Introduction This report is a summary of outputs from the STAR program that aligned the sequence reads of a set of RNA-seq libraries. It uses information provided by the Log.final.out file generated by STAR from each RNA-seq library. Its primary goal is to evaluate the consistency of several summary statistics, such as alignment rate and mismatch frequency, between multiple RNA-seq libraries.
RNA-seq data in 3 immune cells of 4 donors
STAR was run using the following options:
Table 1. Summary of the summary statistics of all libraries, including the total number of sequence reads, percent of uniquely mapped reads, etc. Click here to see the summary statistics of individual libraries.
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | |
|---|---|---|---|---|---|---|
| Total input, million reads | 30.74 | 32.7000 | 34.480 | 36.3400 | 40.3900 | 44.52 |
| Alignment rate (%), unique mapping | 80.45 | 82.9800 | 83.600 | 84.1800 | 85.3400 | 88.79 |
| Alignment rate (%), unique + multiple | 92.64 | 93.3500 | 93.640 | 93.6200 | 93.8700 | 94.52 |
| Mismatch rate (%) | 0.27 | 0.2800 | 0.295 | 0.2992 | 0.3200 | 0.33 |
| Deletion rate (%) | 0.02 | 0.0200 | 0.020 | 0.0200 | 0.0200 | 0.02 |
| Insertion rate (%) | 0.01 | 0.0100 | 0.010 | 0.0100 | 0.0100 | 0.01 |
| Too many loci (%) | 0.05 | 0.0675 | 0.110 | 0.1033 | 0.1325 | 0.15 |
| Too many mismatch (%) | 0.09 | 0.1000 | 0.110 | 0.1125 | 0.1200 | 0.14 |
| Too short (%) | 5.23 | 5.7850 | 6.055 | 6.0860 | 6.4050 | 7.10 |
In most RNA-seq data sets, the percentage of total input reads that can be aligned to reference genome/transcriptome could range between 50% and 90%. Alignment rate is an important quality index of RNA-seq library and high throughput sequencing. However, it also highly depends on the experimental material and protocol, so it is hard to have a predefined cutoff of “high” alignment rate for all data sets. On the other hand, the consistence of alignment rates between samples of the same data set is at least equally important. Inconsistency of alignment rates is usually the consequence of systematic bias during the whole experimental procedure. It adds unwanted between-sample variance into data and might have profound impact on statistic analysis, such as differential gene expression. Therefore, the focus of this analysis is whether there are libraries having much lower alignment rates than the others.
The rate of unique vs. multiple alignment is a similar index of data quality. High percent of multiple alignment might indicate low complexity of sequence reads, higher sequencing error rate, and other issues. This analysis also evaluates the consistency of unique vs. multiple alignment between samples.
Figure 1. The global alignment rate (left) and the rate of unique vs. multiple alignment (right). Each spot represents a RNA-seq library and is colored based on number of sigma. For each library, a linear model is built with all the other libraries and the value of sigma (variance of random error) is obtained from the model. The number of sigma is then calculated by dividing the observed-predicted difference of that library with the sigma value.
An important aspect of processing RNA-seq data is to alignment sequence reads to splicing sites, called gap alignment. Most commonly, STAR performs gap alignment first by using the known splicing sites based on the reference transcriptome and then by detecting novel splicing sites based on the reference genome. Most splicing sites have canonical donor/acceptor bases, such as GT/AG. While non-canonical splicing sites have been observed, they are relatively rare and often suggestive of false positives.
Figure 2. The total number of reads gap-aligned reads and the number of gap-aligned with non-canonical splicing sites are fitted to linear models as in Figure 1. On average of all samples in this data set, 1.128% of all gap-aligned reads have non-canonical splicing sites.
STAR alignment reports the frequency of mismatch, deletion, and insertion bases. The consistency of these statistics should also be evaluated.
Figure 3. Distribution of insertion/deletion/mismatch frequency in all samples.
STAR alignment also reports the percent of unmapped reads due to different reasons, including too many mismatches, too short, and other. Additionally, it reports to percent of reads that were mapped, but to too many loci. Again, the focus here is the consistency of these percents between samples.
Figure 4. Distribution of the frequency of poorly aligned reads due to different reasons. The frequency is relative to all mapped reads in the first plot, and relative to all unmapped reads in the others.
Listed below are samples with potentially quality problem, based on consistency of summary statistics between samples:
Click links to view full tables of summary statistics of all samples:
Check out the RoCA home page for more information.
To reproduce this report:
Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory
To generate a new report using your own input data and parameter, edit the following items in the YAML file:
Run the code below within R Console or RStudio, preferablly with a new R session:
if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }
CreateReport(filename.yaml); # filename.yaml is the YAML file you just downloaded and edited for your analysis
If there is no complaint, go to the output folder and open the index.html file to view report.
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pander_0.6.0 htmlwidgets_0.7 DT_0.2
## [4] RoCA_0.0.0.9000 yaml_2.1.13 rmarkdown_1.3
## [7] knitr_1.14 flexclust_1.3-4 modeltools_0.2-21
## [10] lattice_0.20-33 clusterSim_0.45-1 MASS_7.3-45
## [13] clValid_0.6-6 som_0.3-5.1 cluster_2.0.4
## [16] gplots_3.0.1 awsomics_0.0.0.9000 RCurl_1.95-4.8
## [19] bitops_1.0-6 devtools_1.12.0 plotly_4.5.2
## [22] ggplot2_2.1.0
##
## loaded via a namespace (and not attached):
## [1] rgl_0.96.0 Rcpp_0.12.6 tidyr_0.5.1
## [4] class_7.3-14 gtools_3.5.0 assertthat_0.1
## [7] rprojroot_1.2 digest_0.6.10 mime_0.5
## [10] R6_2.1.2 plyr_1.8.4 backports_1.0.3
## [13] evaluate_0.9 e1071_1.6-7 highr_0.6
## [16] httr_1.2.1 lazyeval_0.2.0 curl_1.2
## [19] gdata_2.17.0 webshot_0.3.2 stringr_1.0.0
## [22] munsell_0.4.3 shiny_0.13.2 httpuv_1.3.3
## [25] base64enc_0.1-3 htmltools_0.3.5 tibble_1.1
## [28] viridisLite_0.1.3 dplyr_0.5.0 withr_1.0.2
## [31] jsonlite_1.0 xtable_1.8-2 gtable_0.2.0
## [34] DBI_0.4-1 git2r_0.15.0 magrittr_1.5
## [37] formatR_1.4 scales_0.4.0 KernSmooth_2.23-15
## [40] stringi_1.1.1 dbscan_0.9-8 modeest_2.1
## [43] tools_3.2.2 ade4_1.7-4 purrr_0.2.2
## [46] parallel_3.2.2 colorspace_1.2-6 caTools_1.17.1
## [49] R2HTML_2.3.2 memoise_1.0.0
END OF DOCUMENT